Random forest

Random forest (or random forests) is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman[1] and Adele Cutler, and "Random Forests" is their trademark. The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breiman's "bagging" idea and the random selection of features, introduced independently by Ho[2][3] and Amit and Geman[4] in order to construct a collection of decision trees with controlled variation.

The selection of a random subset of features is an example of the random subspace method, which, in Ho's formulation, is a way to implement stochastic discrimination[5] proposed by Eugene Kleinberg.

Contents

Learning algorithm

Each tree is constructed using the following algorithm:

  1. Let the number of training cases be N, and the number of variables in the classifier be M.
  2. We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.
  3. Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.
  4. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
  5. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).

For prediction a new sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the average vote of all trees is reported as random forest prediction.

Features and Advantages

The advantages of random forest are:

Disadvantages

Visualization

In order to form an intuitive visualization of the model-space represented by a random forest, a dataset consisting of 200 random points (100 green points and 100 red points) was created. The green points were drawn from a Gaussian distribution with a centroid at (0,1), and the red points were drawn from a Gaussian distribution with a centroid at (1,0). In both cases, the variance was circular with an average radius of 1.

A Random Forest model, consisting of 50 trees, was trained on this data. The purity of the color indicates the portion of the 50 trees that voted in agreement. Significant over-fit can be observed in this visualization.

For contrast, a logistic regression model (which is somewhat less-prone to over-fit) was also trained on this same data.

(Typically, random forest is best-suited for use with categorical features, but continuous features were used in this illustration because they were easier to visualize.)

See also

References

  1. ^ Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32. doi:10.1023/A:1010933404324. 
  2. ^ Ho, Tin (1995). "Random Decision Forest". 3rd Int'l Conf. on Document Analysis and Recognition. pp. 278–282. http://cm.bell-labs.com/cm/cs/who/tkh/papers/odt.pdf. 
  3. ^ Ho, Tin (1998). "The Random Subspace Method for Constructing Decision Forests". IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (8): 832–844. doi:10.1109/34.709601. http://cm.bell-labs.com/cm/cs/who/tkh/papers/df.pdf. 
  4. ^ Amit, Y.; Geman, D. (1997). "Shape quantization and recognition with randomized trees". Neural Computation 9 (7): 1545–1588. doi:10.1162/neco.1997.9.7.1545. http://www.cis.jhu.edu/publications/papers_in_database/GEMAN/shape.pdf. 
  5. ^ Kleinberg, Eugene (1996). "An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition". Annals of Statistics 24 (6): 2319–2349. doi:10.1214/aos/1032181157. MR1425956. http://kappa.math.buffalo.edu/aos.pdf. 
  6. ^ Caruana, Rich; Karampatziakis, Nikos; Yessenalina, Ainur (2008). "An empirical evaluation of supervised learning in high dimensions". Proceedings of the 25th International Conference on Machine Learning (ICML). http://icml2008.cs.helsinki.fi/papers/632.pdf. 
  7. ^ Segal, Mark R. (April 14 2004). Machine Learning Benchmarks and Random Forest Regression. Center for Bioinformatics & Molecular Biostatistics. http://repositories.cdlib.org/cbmb/bench_rf_regn/. 
  8. ^ Deng,H.; Runger, G.; Tuv, E. (2011). "Bias of importance measures for multi-valued attributes and solutions". Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN). http://enpub.fulton.asu.edu/hdeng3/MultiICANN2011.pdf. 

Commercial implementation

Open source implementations

External links